The dataset used in this project was taken from github, and is based on members of the superhero group Avengers. The columns of the dataset include: Name.Alias, Appearances, Current, Gender, Probationary.Intro, Full.Reserve.Avengers.Intro, Year, Years.since.joining, Honorary, Death1 and Return1. In this project analyses were done to examine and visualize many of the prior listed variables.
In this section a contingency table and a mosaic plot is used to examine the disparity between female and male Avengers, and to determine what ratios of each have died throughout the series. Both variables Gender and Death1 are categorical as the values provided in the data set are Female, Male and Yes, No respectively. In the mosaic plot below we can determine that slightly more Male Avengers have died throughout the series in comparison to women, though there is not a large difference between the two.
The numerical variable examined in this section is the Year Avengers joined the team. From the boxplot below we can see the central portion of the data is between 1978 and 2010, meaning that a large portion of Avengers joined within this time period. The boxplot also has two whiskers which indicate the spread of the data and suggests that there are a few other years where Avengers joined the team that do not fit within the central portion of the data. The lower bound of the whisker is 1963, while the upper bound is 2015. There is one outlier, the year 1900 where an Avenger joined the team. Since there is not a large portion of data within this time perios that year marker is an outlier.
The plot below shows a scatterplot of numerical data from the variables, Years since joining (y-axis) the Avengers and the number of Appearances (x-axis) the Avenger has made throughout the Cinematic Universe. I expected there to be a strong positive correlation between these variables, as I thought the longer an Avenger is on the team the more Appearances they will accumulate. This assumption was wrong however as there is no clear distribution of the data, mos t of which is clustered in the lower left quadrant. The numerical markers are also separated out by Gender, with the blue markers denoting Female Avengers and the green markers denoting Male Avengers. Using a qualitative observation we can determine there are more Male Avengers than Female on the team and the cluster of Avengers with the largest number of Appearances is also mostly Male. The lack of a positive correlation can be accounted for the period with which the Avengers became trendy. Within the past 10-20 years there have been a number of movies made about the Avengers but on independent Avengers and the entire team. Because of the recent popularity of this series there would not be a strong number of Appearance or Years since joining on the right quadrants of the graph. There are still a few outliers though, most likely from that movie/s from 1900.
The following histogram shows the distribution the variable Years since joining. The graph is right skewed with a tail trailing off to the right. As noted prior we can most likely account for this trend as the Avengers have recently risen in popularity so more superheroes have joined the team recently. Hence the lower frequencies exhibited on the right tail.
The variable Years since joining was used to demonstrate the theory of Central Limit Theorem. In this case the original distribution of the data is found in the previous graph. In this section sample sizes of 10, 20, 30, 40 were used to conduct random sampling using sampling of means. Sample sizes 10, 20, 30 show a slight right skew, similar to the original distribution of the data. However the distribution is still unimodal and close to the normal distribution. The final sample size 40 is sysmetric and is a good example of the normal distribution obtained after normalizing the data using the sampling of means method.
This section uses three sampling methods to draw data with a sample size of 50 from the larger dataset using the Appearances variable. The first graph shows the distribution of the full dataset, with a right skew and a few outliers on the right quadrants of the graph. The second graph uses the sampling technique of simple random sampling without replacement. The third graph uses systematic sampling where each value has an equal probability of being selected. While the fourth graph employs inclusion probabilities to determine the proability each value will be selected based on its size. We can observe in the graph below that each of the three sampling methods result in a histogram with a similar distribution to the original full dataset, even with appropriate outlier values. This allows us to conclude that using sampling methods to conduct analyses on smaller samples of the full dataset is a viable option when doing data analyses, and can provide in similar results.
This final graph demonstrates a technique not mentioned in the instructions or taught in this course. A 2D histogram was constructed using the scatterplot from Part2 showcasing the Years since joining against the Appearances. The gender variable was collapsed in the scatterplot used in this section to allow for the effective construction of the 2D histogram. This new graph gives a qualitative/ colour representation of where the data is clusted on a scatterplot. The scale shown on the right gives a colour distribution that shows the frequency of markers on the graph. We can see in the graph that the lower left corner, where most of the data is clustered has a variety of colours indicating values in the 60+ to ~30 marker frequency. We can also see lighter squares in the upper left quadrant of the graph, indicating a lower market frequency. Lastly on the lower right quandrant we can see out outliers from the scatterplot highlighted in blue indicating a marker frequency of ~20.